Search CORE

111 research outputs found

Vicinity-driven paragraph and sentence alignment for comparable corpora

Author: Paetzold GH
Specia L
Publication venue: 'Center for Open Science'
Publication date: 13/12/2016
Field of study

Parallel corpora have driven great progress in the field of Text Simplification. However, most sentence alignment algorithms either offer a limited range of alignment types supported, or simply ignore valuable clues present in comparable documents. We address this problem by introducing a new set of flexible vicinity-driven paragraph and sentence alignment algorithms that 1-N, N-1, N-N and long distance null alignments without the need for hard-to-replicate supervised models

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository

Semantic modelling of user interests based on cross-folksonomy analysis

Author: C. Cattuto
D. Silver
J.W. Tanaka
L. Specia
S.A. Golder
U. Bojars
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

The continued increase in Web usage, in particular participation in folksonomies, reveals a trend towards a more dynamic and interactive Web where individuals can organise and share resources. Tagging has emerged as the de-facto standard for the organisation of such resources, providing a versatile and reactive knowledge management mechanism that users find easy to use and understand. It is common nowadays for users to have multiple profiles in various folksonomies, thus distributing their tagging activities. In this paper, we present a method for the automatic consolidation of user profiles across two popular social networking sites, and subsequent semantic modelling of their interests utilising Wikipedia as a multi-domain model. We evaluate how much can be learned from such sites, and in which domains the knowledge acquired is focussed. Results show that far richer interest profiles can be generated for users when multiple tag-clouds are combine

Southampton (e-Prints Soton)

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Open Research Online (The Open University)

Biblos-e Archivo

Joint Emotion Analysis via Multi-task Gaussian Processes

Author: Beck D.
Cohn T.
Specia L.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2014
Field of study

We propose a model for jointly predicting multiple emotions in natural language sentences. Our model is based on a low-rank coregionalisation approach, which combines a vector-valued Gaussian Process with a rich parameterisation scheme. We show that our approach is able to learn correlations and anti-correlations between emotions on a news headlines dataset. The proposed model outperforms both singletask baselines and other multi-task approaches

CiteSeerX

Crossref

White Rose Research Online

Multi-hypothesis machine translation evaluation

Author: Fomicheva M.
Guzmán F.
Specia L.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Reliably evaluating Machine Translation (MT) through automated metrics is a long-standing problem. One of the main challenges is the fact that multiple outputs can be equally valid. Attempts to minimise this issue include metrics that relax the matching of MT output and reference strings, and the use of multiple references. The latter has been shown to significantly improve the performance of evaluation metrics. However, collecting multiple references is expensive and in practice a single reference is generally used. In this paper, we propose an alternative approach: instead of modelling linguistic variation in human reference we exploit the MT model uncertainty to generate multiple diverse translations and use these: (i) as surrogates to reference translations; (ii) to obtain a quantification of translation variability to either complement existing metric scores or (iii) replace references altogether. We show that for a number of popular evaluation metrics our variability estimates lead to substantial improvements in correlation with human judgements of quality by up 15%

Crossref

White Rose Research Online

Exact decoding for phrase-based statistical machine translation

Author: Aziz W.
Dymetman M.
Specia L.
Publication venue
Publication date: 01/01/2014
Field of study

© 2014 Association for Computational Linguistics. The combinatorial space of translation derivations in phrase-based statistical machine translation is given by the intersection between a translation lattice and a target language model. We replace this intractable intersection by a tractable relaxation which incorporates a low-order upperbound on the language model. Exact optimisation is achieved through a coarseto- fine strategy with connections to adaptive rejection sampling. We perform exact optimisation with unpruned language models of order 3 to 5 and show searcherror curves for beam search and cube pruning on standard test sets. This is the first work to tractably tackle exact optimisation with language models of orders higher than 3

CiteSeerX

Crossref

White Rose Research Online

UvA-DARE

International Migration, Integration and Social Cohesion online publications

A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation

Author: D. Thorleuchter
L. Specia
Y. Deng
Publication venue
Publication date: 01/01/2014
Field of study

Text alignment is crucial to the accuracy of Machine Translation (MT) systems, some NLP tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED Talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with described tool is shown.Comment: corpora filtration, text alignement, corpora improvement. arXiv admin note: text overlap with arXiv:1509.0888

arXiv.org e-Print Archive

Crossref

Recommended from our members

Deciding when, how and for whom to simplify

Author: Madhyastha P.
Scarton C.
Specia L.
Publication venue: 'IOS Press'
Publication date: 01/01/2020
Field of study

Current Automatic Text Simplification (TS) work relies on sequence-to-sequence neural models that learn simplification operations from parallel complex-simple corpora. In this paper we address three open challenges in these approaches: (i) avoiding unnecessary transformations, (ii) determining which operations to perform, and (iii) generating simplifications that are suitable for a given target audience. For (i), we propose joint and two-stage approaches where instances are marked or classified as simple or complex. For (ii) and (iii), we propose fusion-based approaches to incorporate information on the target grade level as well as the types of operation to perform in the models. While grade-level information is provided as metadata, we devise predictors for the type of operation. We study different representations for this information as well as different ways in which it is used in the models. Our approach outperforms previous work on neural TS, with our best model following the two-stage approach and using the information about grade level and type of operation to initialise the encoder and the decoder, respectively

City Research Online

White Rose Research Online

Exact Decoding for Phrase-Based Statistical Machine Translation

Author: Aziz W.
Dymetman M.
Specia L.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2014
Field of study

International Migration, Integration and Social Cohesion online publications

Deep copycat networks for text-to-text generation.

Author: Ive J
Madhyastha P
Specia L
Publication venue
Publication date: 01/01/2019
Field of study

Most text-to-text generation tasks, for example text summarisation and text simplification, require copying words from the input to the output. We introduce Copycat, a transformer-based pointer network for such tasks which obtains competitive results in abstractive text summarisation and generates more abstractive summaries. We propose a further extension of this architecture for automatic post-editing, where generation is conditioned over two inputs (source language and machine translation), and the model is capable of deciding where to copy information from. This approach achieves competitive performance when compared to state-of-the-art automated post-editing systems. More importantly, we show that it addresses a well-known limitation of automatic post-editing - overcorrecting translations - and that our novel mechanism for copying source language words improves the results

Crossref

Spiral - Imperial College Digital Repository

Probing the need for visual context in multimodal machine translation

Author: Barrault L.
Caglayan O.
Madhyastha P.
Specia L.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the general case, however, we believe that it is possible to combine visual and textual information in order to ground translations. In this paper we probe the contribution of the visual modality to state-of-the-art MMT models by conducting a systematic analysis where we partially deprive the models from source-side textual context. Our results show that under limited textual context, models are capable of leveraging the visual input to generate better translations. This contradicts the current belief that MMT models disregard the visual modality because of either the quality of the image features or the way they are integrated into the model

arXiv.org e-Print Archive

Crossref

Spiral - Imperial College Digital Repository

White Rose Research Online